NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Looking Backward: Streaming Video-to-Video Translation with Feature Banks

Liang, Feng; Kodaira, Akio; Xu, Chenfeng; Tomizuka, Masayoshi; Keutzer, Kurt; Marculescu, Diana (April 2025, ICLR 2025 https://iclr.cc/virtual/2025/poster/32105)

Free, publicly-accessible full text available April 24, 2026
Aligning Large Multimodal Models with Factually Augmented RLHF

Sun, Zhiqing; Shen, Sheng; Cao, Shengcao; Liu, Haotian; Li, Chunyuan; Shen, Yikang; Gan, Chuang; Gui, Liangyan; Wang, Yu-Xiong; Yang, Yiming; et al (August 2024, Findings of the Association for Computational Linguistics (ACL Findings))

Full Text Available
SLoRA: Scalable Serving of Thousands of LoRA Adapters

Sheng, Ying; Cao, Shiyi; Li, Dacheng; Hooper, Coleman; Lee, Nicholas; Yang, Shuo; Chou, Christopher; Zhu, Banghua; Zheng, Lianmin; Keutzer, Kurt; et al (May 2024, Proceedings of Machine Learning and Systems 6 (MLSys 2024))

The "pretrain-then-finetune" paradigm is commonly adopted in the deployment of large language models. Low-Rank Adaptation (LoRA), a parameter-efficient fine-tuning method, is often employed to adapt a base model to a multitude of tasks, resulting in a substantial collection of LoRA adapters derived from one base model. We observe that this paradigm presents significant opportunities for batched inference during serving. To capitalize on these opportunities, we present S-LoRA, a system designed for the scalable serving of many LoRA adapters. S-LoRA stores all adapters in the main memory and fetches the adapters used by the currently running queries to the GPU memory. To efficiently use the GPU memory and reduce fragmentation, S-LoRA proposes Unified Paging. Unified Paging uses a unified memory pool to manage dynamic adapter weights with different ranks and KV cache tensors with varying sequence lengths. Additionally, S-LoRA employs a novel tensor parallelism strategy and highly optimized custom CUDA kernels for heterogeneous batching of LoRA computation. Collectively, these features enable S-LoRA to serve thousands of LoRA adapters on a single GPU or across multiple GPUs with a small overhead. Compared to state-of-the-art libraries such as HuggingFace PEFT and vLLM (with naive support of LoRA serving), S-LoRA can improve the throughput by up to 4 times and increase the number of served adapters by several orders of magnitude. As a result, S-LoRA enables scalable serving of many task-specific fine-tuned models and offers the potential for large-scale customized fine-tuning services. The code is available at this https URL
more » « less
Full Text Available
ANODE: Unconditionally Accurate Memory-Efficient Gradients for Neural ODEs

https://doi.org/10.24963/ijcai.2019/103

Gholaminejad, Amir; Keutzer, Kurt; Biros, George (August 2019, International Joint Conferences on Artificial Intelligence)

Residual neural networks can be viewed as the forward Euler discretization of an Ordinary Differential Equation (ODE) with a unit time step. This has recently motivated researchers to explore other discretization approaches and train ODE based networks. However, an important challenge of neural ODEs is their prohibitive memory cost during gradient backpropogation. Recently a method proposed in arXiv:1806.07366, claimed that this memory overhead can be reduced from LNt, where Nt is the number of time steps, down to O(L) by solving forward ODE backwards in time, where L is the depth of the network. However, we will show that this approach may lead to several problems: (i) it may be numerically unstable for ReLU/non-ReLU activations and general convolution operators, and (ii) the proposed optimize-then-discretize approach may lead to divergent training due to inconsistent gradients for small time step sizes. We discuss the underlying problems, and to address them we propose ANODE, a neural ODE framework which avoids the numerical instability related problems noted above. ANODE has a memory footprint of O(L) + O(Nt), with the same computational cost as reversing ODE solve. We furthermore, discuss a memory efficient algorithm which can further reduce this footprint with a tradeoff of additional computational cost. We show results on Cifar-10/100 datasets using ResNet and SqueezeNext neural networks.
more » « less
Full Text Available
Train Big, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers

Li, Zhuohan; Wallace, Eric; Shen, Sheng; Lin, Kevin; Keutzer, Kurt; Klein, Dan; Gonzalez, Joseph E. (January 2020, Proceedings of the International Conference on Machine Learning (ICML))
null (Ed.)
Since hardware resources are limited, the objective of training deep learning models is typically to maximize accuracy subject to the time and memory constraints of training and inference. We study the impact of model size in this setting, focusing on Transformer models for NLP tasks that are limited by compute: self-supervised pretraining and high-resource machine translation. We first show that even though smaller Transformer models execute faster per iteration, wider and deeper models converge in significantly fewer steps. Moreover, this acceleration in convergence typically outpaces the additional computational overhead of using larger models. Therefore, the most compute-efficient training strategy is to counterintuitively train extremely large models but stop after a small number of iterations. This leads to an apparent trade-off between the training efficiency of large Transformer models and the inference efficiency of small Transformer models. However, we show that large models are more robust to compression techniques such as quantization and pruning than small models. Consequently, one can get the best of both worlds: heavily compressed, large models achieve higher accuracy than lightly compressed, small models.
more » « less
Full Text Available
Domain Randomization and Pyramid Consistency: Simulation-to-Real Generalization Without Accessing Target Domain Data

https://doi.org/10.1109/ICCV.2019.00219

Yue, Xiangyu; Zhang, Yang; Zhao, Sicheng; Sangiovanni-Vincentelli, Alberto; Keutzer, Kurt; Gong, Boqing (October 2019, IEEE International Conference on Computer Vision)

We propose to harness the potential of simulation for the semantic segmentation of real-world self-driving scenes in a domain generalization fashion. The segmentation network is trained without any data of target domains and tested on the unseen target domains. To this end, we propose a new approach of domain randomization and pyramid consistency to learn a model with high generalizability. First, we propose to randomize the synthetic images with the styles of real images in terms of visual appearances using auxiliary datasets, in order to effectively learn domain-invariant representations. Second, we further enforce pyramid consistency across different “stylized” images and within an image, in order to learn domaininvariant and scale-invariant features, respectively. Extensive experiments are conducted on the generalization from GTA and SYNTHIA to Cityscapes, BDDS and Mapillary; and our method achieves superior results over the stateof- the-art techniques. Remarkably, our generalization results are on par with or even better than those obtained by state-of-the-art simulation-to-real domain adaptation methods, which access the target domain data at training time.
more » « less
Full Text Available
Large-batch Training for LSTM and Beyond

https://doi.org/10.1145/3295500.3356137

You, Yang; Hseu, Jonathan; Ying, Chris; Demmel, James; Keutzer, Kurt; Hsieh, Cho-Jui (January 2019, SC)

Full Text Available
A LiDAR Point Cloud Generator: from a Virtual World to Autonomous Driving

https://doi.org/10.1145/3206025.3206080

Yue, Xiangyu; Wu, Bichen; Seshia, Sanjit A.; Keutzer, Kurt; Sangiovanni-Vincentelli, Alberto L. (June 2018, ICMR '18 Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval)

3D LiDAR scanners are playing an increasingly important role in autonomous driving as they can generate depth information of the environment. However, creating large 3D LiDAR point cloud datasets with point-level labels requires a significant amount of manual annotation. This jeopardizes the efficient development of supervised deep learning algorithms which are often data-hungry. We present a framework to rapidly create point clouds with accurate pointlevel labels from a computer game. To our best knowledge, this is the first publication on LiDAR point cloud simulation framework for autonomous driving. The framework supports data collection from both auto-driving scenes and user-configured scenes. Point clouds from auto-driving scenes can be used as training data for deep learning algorithms, while point clouds from user-configured scenes can be used to systematically test the vulnerability of a neural network, and use the falsifying examples to make the neural network more robust through retraining. In addition, the scene images can be captured simultaneously in order for sensor fusion tasks, with a method proposed to do automatic registration between the point clouds and captured scene images. We show a significant improvement in accuracy (+9%) in point cloud segmentation by augmenting the training dataset with the generated synthesized data. Our experiments also show by testing and retraining the network using point clouds from user-configured scenes, the weakness/blind spots of the neural network can be fixed.
more » « less
Full Text Available
Counterexample-guided data augmentation

Dreossi, Tommaso; Ghosh, Shromona; Yue, Xiangyu; Keutzer, Kurt; Sangiovanni-Vincentelli, Alberto L; Seshia, Sanjit A. (July 2018, Proceedings of the ... International Conference on Artificial Intelligence)

We present a novel framework for augmenting data sets for machine learning based on counterexamples. Counterexamples are misclassified examples that have important properties for retraining and improving the model. Key components of our framework include a counterexample generator, which produces data items that are misclassified by the model and error tables, a novel data structure that stores information pertaining to misclassifications. Error tables can be used to explain the model's vulnerabilities and are used to efficiently generate counterexamples for augmentation. We show the efficacy of the proposed framework by comparing it to classical augmentation techniques on a case study of object detection in autonomous driving based on deep neural networks.
more » « less
Full Text Available
SqueezeDet: Unified, Small, Low Power Fully Convolutional Neural Networks for Real-Time Object Detection for Autonomous Driving

Wu, Bichen; Iandola, Forrest; Jin, Peter H; Keutzer, Kurt (December 2016, arXiv.org)

Object detection is a crucial task for autonomous driving. In addition to requiring high accuracy to ensure safety, object detection for autonomous driving also requires realtime inference speed to guarantee prompt vehicle control, as well as small model size and energy efficiency to enable embedded system deployment. In this work, we propose SqueezeDet, a fully convolutional neural network for object detection that aims to simultaneously satisfy all of the above constraints. In our network we use convolutional layers not only to extract feature maps, but also as the output layer to compute bounding boxes and class probabilities. The detection pipeline of our model only contains a single forward pass of a neural network, thus it is extremely fast. Our model is fully convolutional, which leads to small model size and better energy efficiency. Finally, our experiments show that our model is very accurate, achieving state-of-the-art accuracy on the KITTI [9] benchmark. The source code of SqueezeDet is open-source released1.
more » « less
Full Text Available

« Prev Next »

Search for: All records